fix: extract Express.js anonymous route handler callbacks by joshbouncesecurity · Pull Request #49 · knostic/OpenAnt

joshbouncesecurity · 2026-05-04T18:42:56Z

Summary

The JS/TS parser pipeline did not extract anonymous arrow function callbacks used as Express route handlers, so most application code in a typical Express app was invisible to analysis.

This PR adds detection in typescript_analyzer.js:

Recognises Express-style call sites (<obj>.<verb>(<path>, ...callbacks) with verb in get|post|put|patch|delete|options|head|all|use).
Filters the receiver to plausibly-Express identifiers (app, router, routes, server, plus chained .route(...)/.Router() calls) so generic .get(...) calls on caches/clients/query-builders aren't misread as routes.
Extracts anonymous arrow / function expressions in callback positions as units, marking the last as route_handler (with is_entry_point: true) and earlier callbacks as route_middleware.
Adds metadata: http_method, http_path, callback_index, named_middleware.
Emits call-graph edges from each anonymous callback to named middleware identifiers in the same call (e.g. authenticateToken); dependency_resolver.js merges these "explicit" edges with the body-text regex edges so reachability/upstream-deps see the relationship.

unit_generator.js surfaces the new fields on each unit (route, is_entry_point, http_method, http_path, callback_index). Existing route_handlers detected by the per-function classifier still flow through unchanged.

Addresses #21 (does not close — verify against the user's reported repo first).

Test plan

tests/parsers/javascript/test_express_route_handlers.py covers:
- anonymous handler with named middleware: extraction, is_entry_point=true, http_method/http_path, call-graph edge to authenticateToken
- handler with no middleware: clean unit with no extra edges
- router.use('/api', anonMw1, anonMw2, anonHandler): one route_handler + two route_middleware
- non-Express myCache.get('foo', () => {}) and queryBuilder.post(...): nothing extracted
- named-only router.get('/x', namedHandler): no anonymous unit synthesised; named function still extracted by the regular path
Existing JS parser tests (tests/test_js_parser.py) still pass.
Full suite: pytest tests/ -> 96 passed, 12 skipped, 3 xfailed (preexisting Windows-only xfails).
Manual: parse a small Express app — anonymous route handler bodies appear as units in dataset.json, is_entry_point: true, route populated.

The JS/TS parser pipeline did not extract anonymous arrow function callbacks used as Express route handlers, so most application code in a typical Express app was invisible to analysis. This change adds detection in typescript_analyzer.js: - Recognises Express-style call sites (`<obj>.<verb>(<path>, ...callbacks)` with verb in `get|post|put|patch|delete|options|head|all|use`). - Filters the receiver to plausibly-Express identifiers (`app`, `router`, `routes`, `server` and chained `.route(...)`/`.Router()`) so generic `.get(...)` calls on caches/clients aren't misread as routes. - Extracts anonymous arrow / function expressions in callback positions as units, marking the last as `route_handler` (with `is_entry_point: true`) and earlier callbacks as `route_middleware`. - Adds metadata: `http_method`, `http_path`, `callback_index`, `named_middleware`. - Records explicit call-graph edges from each anonymous callback to named middleware identifiers in the same call (e.g. `authenticateToken`); dependency_resolver.js merges these with the body-text regex edges so reachability/upstream-deps work. - unit_generator.js surfaces the metadata on the unit (`route`, `is_entry_point`, `http_method`, `http_path`, `callback_index`). Addresses #21. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Express anonymous route handler extraction added synthetic entries to `data["functions"]` but not to `data["callGraph"]`, breaking the existing invariant `len(callGraph) == len(functions)` exercised by `test_js_parser.TestTypeScriptAnalyzer::test_builds_call_graph`. Emit a callGraph entry for each synthesised route_handler / route_middleware unit, capturing inline call expressions from the callback body. Named middleware identifiers continue to flow through `explicitCalls` and are merged downstream by `dependency_resolver.js`. Add a regression test asserting the callGraph/functions invariant for the synthetic units.

Adds four regression tests for `_extractExpressRouteCallbacks`: - TypeScript-annotated callbacks `(req: Request, res: Response) => {...}` parse correctly and produce the expected synthetic handler unit. - `app.get('/' + prefix, handler)` (dynamic path) is skipped without throwing — confirms the StringLiteral check is a hard gate. - `app.use((req, res, next) => {...})` with no path produces a single unit with http_path=null and http_method='USE'. - `app.get(path, anonMw, namedHandler)` (anon middleware before named handler): the anon callback gets a route_middleware unit with named_middleware=['namedHandler'], and the named handler is left to the regular extractor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshbouncesecurity · 2026-05-04T20:46:27Z

Manual verification

Sample Express file:

const router = express.Router();
router.post('/orders', authenticateToken, async (req, res) => {
  const { productId } = req.body;
  // ...
});

openant parse <repo>: dataset.json contains a unit for the anonymous POST /orders handler with is_entry_point: true, http_method: "POST", http_path: "/orders".
Call graph: edge from the synthesized handler to authenticateToken.
app.get('/users', (req, res) => res.json([])): extracted with no extra middleware edges.
router.use('/api', limiter, auth, async (req, res, next) => {...}) (multi-middleware, all anonymous): one route_handler unit + two route_middleware units.
False-positive guard: myCache.get('foo', () => {}) and queryBuilder.post('users', () => {}): NOT extracted (correctly skipped — non-Express receivers).
Named handler: router.get('/x', namedHandler): no anonymous unit synthesized; namedHandler extracted by the existing path.
TS-typed callback: (req: Request, res: Response): Promise<void> => {...}: extracted correctly with metadata.
Dynamic path: app.get('/' + prefix, ...) (path is not a string literal): correctly skipped, no crash.
On the issue's reported repo (Express app with 7 vulnerabilities): unit count grows; reachability_filter retains the route handlers.

joshbouncesecurity · 2026-05-04T21:04:24Z

Local test results

Built a tiny inline Express fixture (1 file, 13 lines) and ran the JS pipeline (typescript_analyzer + unit_generator) from this branch.

Fixture (orders.js):

const express = require('express');
const router = express.Router();

function authenticateToken(req, res, next) { return next(); }

router.post('/orders', authenticateToken, async (req, res) => {
  const { productId } = req.body;
  res.json({ ok: true, productId });
});

module.exports = router;

Commands run:

node typescript_analyzer.js <repo> --files-from list.txt --output analyzer_output.json
node unit_generator.js analyzer_output.json --output dataset.json

Outcome (per checklist item):

Anonymous POST /orders handler appears as a unit (orders.js:express(POST:/orders:8:1)) ✅
is_entry_point: true on the synthesized handler ✅
HTTP method and path are present — but on the unit's route field as route.method = 'POST' / route.path = '/orders' (the top-level http_method / http_path / callback_index shown in the checklist are not populated on the unit; they may be intentionally consumed only at intermediate stages or there may be a small surfacing gap in unit_generator.js) ⚠️
Call-graph edge from the anonymous handler to authenticateToken is present in metadata.direct_calls = ['orders.js:authenticateToken']; the named middleware also lists the handler as a caller in its direct_callers ✅
route.middleware = ['authenticateToken'] correctly captures the named middleware ✅
Did not separately exercise the multi-middleware, false-positive guard (myCache.get, queryBuilder.post), TS-typed callback, dynamic-path skip, or the originally-reported repo on this run — those are covered by tests/parsers/javascript/test_express_route_handlers.py per the diff.

One small thing worth a look: the manual-verification checklist mentions top-level http_method / http_path / callback_index fields, but in dataset.json only route.method / route.path are populated. Either the checklist text or unit_generator.js's field-surfacing might want a small alignment.

ar7casper

Solid PR overall — the receiver filter, named-middleware-as-call-graph-edge wiring, and the callGraph invariant test are all real improvements over a naive "extract anything that looks like a callback" approach. One blocker, two medium nits below.

Blocker: anonymous middleware bodies are silently dropped from analysis.

The new route_middleware unitType isn't in ENTRY_POINT_TYPES in utilities/agentic_enhancer/entry_point_detector.py:26-31 (currently {route_handler, view_function, websocket_handler, cli_handler}), and the synth units have isEntryPoint: false. The chain:

Analyzer creates route_middleware units ✅
Entry point detector doesn't flag them (unitType not in ENTRY_POINT_TYPES, isEntryPoint false)
Reachability filter has no entry point to reach them from (Express invokes middleware via the framework, not via a code-resolvable call edge)
Filtered out before the LLM sees them.

Net effect: a route like

app.post('/x',
  async (req, res, next) => { /* anonymous middleware doing dangerous thing */ next(); },
  async (req, res) => { /* the handler */ }
);

gets a unit for the handler ✅ but the middleware body is silently dropped — even though it sees req directly and could be doing absolutely anything dangerous. That's the inverse of what #21 asks for.

One-line fix: add 'route_middleware' to ENTRY_POINT_TYPES. Or set isEntryPoint: true for all callbacks (matches the simpler-but-correct approach in #48). I'd suggest the former since you've already done the structural work to distinguish handler vs middleware — making both entry points preserves that distinction without losing analysis coverage. Worth adding a regression test that asserts the middleware unit survives the reachability filter end-to-end.

Two inline nits below (medium severity, not blockers).

Also worth flagging for follow-up (non-blocking):

route_middleware is a new unit_type the LLM enhancer prompt has no specific guidance for. The schema accepts arbitrary strings so no crash, but the LLM doesn't have a mental model for "this is middleware vs the handler." Probably fine, just noting.
You flagged the spurious self-edge (UserController.get → UserController.get) on #39 a couple days ago — extractCallsFromFunction(arg, relativePath) is called per-handler here too, same code path, same risk. Worth checking if the same pattern shows up in this PR's output.

ar7casper · 2026-05-12T11:41:57Z

+   * `.get(...)` calls on caches / clients / query-builders aren't misread
+   * as routes.
+   */
+  _isPlausibleExpressReceiver(receiver) {


The receiver filter has known false negatives for common naming conventions: codebases that use single-word identifiers like web, api, endpoints, controller for the Express app/router get no extraction at all, since those names don't end in app/router/routes/server. From a quick scan of popular Express apps in the wild, web and api are not unusual choices.

Suggestions (non-blocking):

Either expand the suffix list (add endpoints, controller, web, api — or treat any short identifier as a candidate and rely on the verb-set + string-path filter to reject false positives), or

Document the assumption in the JSDoc and the route_handler extraction docstring so reviewers / users understand why their app might silently produce zero route units.

Fixed in f869c97. Expanded the stem list to include web, api, endpoints, and controller, and consolidated both the Identifier and PropertyAccessExpression branches to share a single static EXPRESS_RECEIVER_STEMS field so they can't drift out of sync. Updated the JSDoc to document the remaining coverage boundary (identifiers outside the stem list will produce zero units).

ar7casper · 2026-05-12T11:41:57Z

+        } else if (k === "PropertyAccessExpression") {
+          // e.g. middleware.auth — keep the trailing name
+          const name = arg.getName ? arg.getName() : arg.getText();
+          namedMiddleware.push(name);


For PropertyAccessExpression args (app.post('/x', middleware.auth, ...)), this stores just the trailing name ("auth"), which _resolveCall will then look up in the global function map. If there's any other function called auth anywhere in the codebase (e.g. someClass.auth), _resolveCall may resolve to the wrong one — silent false-positive call graph edge.

Suggestions (non-blocking):

Store the full arg.getText() (e.g. "middleware.auth") and let dependency_resolver's existing _resolveMethodCall handle the property-access pattern.

Or, if you want to keep the simple-name semantics, document the limitation and add a test exercising the ambiguous-name case so the behavior is intentional.

Documented with a comment explaining the known limitation — trailing-name-only storage (e.g. "auth" from "middleware.auth") can resolve to the wrong target if another function shares the same simple name. Chose documentation over a dual-file refactor (also touching dependency_resolver.js) as the simpler, lower-risk path for a non-blocking nit.

…filter Blocker: add 'route_middleware' to ENTRY_POINT_TYPES so anonymous Express middleware bodies are not silently dropped by the reachability filter. Add regression tests for EntryPointDetector that assert both route_handler and route_middleware units survive as entry points. Non-blocking nit 1: expand _isPlausibleExpressReceiver to accept common single-word app/router identifiers (web, api, endpoints, controller) that were previously missed, causing zero route units to be emitted for those codebases. Consolidate the stem list into a static class field so both the Identifier and PropertyAccessExpression branches stay in sync. Non-blocking nit 2: document the known limitation that PropertyAccessExpression middleware args are stored as trailing-name only (e.g. "auth" from "middleware.auth"), which can produce false-positive call-graph edges when another function shares the same simple name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

joshbouncesecurity · 2026-05-12T15:18:03Z

Actioned all review comments in f869c97 — summary of what was fixed:

Blocker: route_middleware silently dropped
Added 'route_middleware' to ENTRY_POINT_TYPES in entry_point_detector.py. Anonymous Express middleware bodies now survive the reachability filter and are passed to the LLM. Added a regression test in tests/test_entry_point_detector.py asserting both route_handler and route_middleware units are detected as entry points end-to-end.

Nit 1: receiver filter false negatives
Expanded _isPlausibleExpressReceiver to cover web, api, endpoints, and controller as accepted stems. Consolidated the stem list into a single static EXPRESS_RECEIVER_STEMS field shared by both the Identifier and PropertyAccessExpression branches so they stay in sync. Updated JSDoc to document what's still out of scope.

Nit 2: PropertyAccessExpression trailing-name ambiguity
Documented the known limitation with an inline comment rather than the more invasive dual-file refactor. The comment explains that storing only the trailing name (e.g. "auth" from "middleware.auth") can produce a false-positive call-graph edge if another function shares the same simple name.

Self-edge concern (checked, no change)
extractCallsFromFunction when called on an ArrowFunction node calls .getDescendantsOfKind(funcNode.getKind()) — which returns nested arrow functions — then filters for CallExpression, producing an empty array. No calls are emitted from this path, so spurious self-edges cannot occur here. The implementation is effectively a placeholder; noted but out of scope for this PR.

joshbouncesecurity · 2026-05-12T15:18:51Z

@ar7casper responded

ar7casper

Round-2 review — thanks for actioning all 3 round-1 findings cleanly in f869c97.

Verified each fix:

Round-1	Status
🟠 Blocker — `route_middleware` not in `ENTRY_POINT_TYPES`	✅ Added at `entry_point_detector.py:28` + new regression test in `test_entry_point_detector.py`
🟡 Medium — Receiver filter false negatives	✅ Added `web/api/endpoints/controller` to `EXPRESS_RECEIVER_STEMS` static field (consolidated for both Identifier and PropertyAccess branches)
🟡 Medium — Named-middleware trailing-name	⚖️ Documented as known limitation at `typescript_analyzer.js:406-410`. Reasonable call — fix scope was non-trivial. Comment names the failure mode for future readers.

CI all green on Ubuntu/macOS/Windows. Solid PR.

One Medium worth raising for awareness (not blocking — it's pre-existing) and two housekeeping Lows:

🟡 Medium — JS pipeline doesn't normalize `unitType` → `unit_type` (pre-existing, not introduced by this PR)

The new regression test (test_route_middleware_is_entry_point) passes in isolation because it constructs func_data with unit_type (snake_case) directly. But in the JS production pipeline the chain is:

typescript_analyzer.js:446 emits unitType: "route_middleware" (camelCase)
parsers/javascript/test_pipeline.py:497 reads functions = analyzer.get("functions", {}) — no normalization
entry_point_detector.py:176 reads func_data.get('unit_type', '') (snake_case) — misses the camelCase key
unit_type Check 1 silently returns no match for ALL JS units
Only Check 3 (input patterns: req.body, req.params, etc. in code body) actually fires

For typical Express middleware that touch req.body/req.headers, Check 3 catches them — practical effect is fine. But for minimal middleware like async (req, res, next) => next(); (rare but possible), Check 3 doesn't match and the unit gets dropped despite this PR's ENTRY_POINT_TYPES fix.

This is the documented bug per parser-issues #23-26 + Room-for-Improvement #10 — predates this PR. Not blocking for merge, but worth noting because the regression test gives somewhat-false confidence: the structural piece is correct, but the production effect for JS depends on the broader normalization work.

Suggestions (any one, all non-blocking):

Most thorough: fix the upstream parser-issues #23-26 in a separate PR — normalize JS analyzer output to use snake_case before passing to EntryPointDetector.
PR-scoped: add an end-to-end test that runs the full JS pipeline (typescript_analyzer.js → unit_generator.js → EntryPointDetector reading the actual analyzer output) and asserts a route_middleware unit with no req.X body references survives the reachability filter. Would surface the gap concretely.
Minimum: acknowledge in PR body that the structural fix is complete but practical effect for JS middleware depends on the broader unit_type normalization work.

🟢 Low — PR body stale

🟢 Low — New receiver patterns lack test coverage

The new stems (web, api, endpoints, controller) work as designed, but no test exercises them. Existing tests use router and app. A future refactor that breaks the regex for one of the new patterns wouldn't be caught.

Suggestion: parameterize the existing happy-path test, or add 1-2 new fixtures with web.get(...) and api.post(...) to lock in the expanded behavior.

Verdict: ✅ ready to merge. Approving via UI.

Covers the seven PRs in this release: - #35: parse --level default → reachable (CLI consistency with scan + Python CLI) - #36: auto-detect dep changes via ~/.openant/venv/.deps-hash - #37: lazy JS parser npm bootstrap on first use - #39: TypeScript/NestJS DI-aware call resolution (constructor + field + functional inject()) - #40: --language auto opt-in for openant init + non-git path support + shared config/languages.json - #49: Express anonymous route handler extraction (route_handler / route_middleware) - #50: --llm-reachability opt-in stage + cross-parser call_graph.json contract Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshbouncesecurity and others added 3 commits May 4, 2026 21:42

ar7casper reviewed May 12, 2026

View reviewed changes

joshbouncesecurity marked this pull request as ready for review May 12, 2026 15:18

joshbouncesecurity requested review from dgeyshis, shahar-davidson, sounil and yotamleo as code owners May 12, 2026 15:18

joshbouncesecurity changed the base branch from master to release/2026-05-12 May 12, 2026 16:53

ar7casper reviewed May 13, 2026

View reviewed changes

ar7casper merged commit 6223d86 into knostic:release/2026-05-12 May 13, 2026
9 checks passed

ar7casper mentioned this pull request May 13, 2026

feat(Release): 2026-05-12 #58

Open

3 tasks

This was referenced May 13, 2026

Normalize unit_type field shape across parsers so EntryPointDetector's unit-type check works for JS/Go #63

Open

fix(parsers/javascript): extract Express anonymous route handler callbacks (#21) #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: extract Express.js anonymous route handler callbacks#49

fix: extract Express.js anonymous route handler callbacks#49
ar7casper merged 4 commits into
knostic:release/2026-05-12from
joshbouncesecurity:fix/issue21-express-handlers

joshbouncesecurity commented May 4, 2026

Uh oh!

joshbouncesecurity commented May 4, 2026

Uh oh!

joshbouncesecurity commented May 4, 2026

Uh oh!

ar7casper left a comment

Uh oh!

ar7casper May 12, 2026

Uh oh!

joshbouncesecurity May 12, 2026

Uh oh!

ar7casper May 12, 2026

Uh oh!

joshbouncesecurity May 12, 2026

Uh oh!

joshbouncesecurity commented May 12, 2026

Uh oh!

joshbouncesecurity commented May 12, 2026

Uh oh!

ar7casper left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshbouncesecurity commented May 4, 2026

Summary

Test plan

Uh oh!

joshbouncesecurity commented May 4, 2026

Manual verification

Uh oh!

joshbouncesecurity commented May 4, 2026

Local test results

Uh oh!

ar7casper left a comment

Choose a reason for hiding this comment

Uh oh!

ar7casper May 12, 2026

Choose a reason for hiding this comment

Uh oh!

joshbouncesecurity May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ar7casper May 12, 2026

Choose a reason for hiding this comment

Uh oh!

joshbouncesecurity May 12, 2026

Choose a reason for hiding this comment

Uh oh!

joshbouncesecurity commented May 12, 2026

Uh oh!

joshbouncesecurity commented May 12, 2026

Uh oh!

ar7casper left a comment

Choose a reason for hiding this comment

🟡 Medium — JS pipeline doesn't normalize unitType → unit_type (pre-existing, not introduced by this PR)

🟢 Low — PR body stale

🟢 Low — New receiver patterns lack test coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🟡 Medium — JS pipeline doesn't normalize `unitType` → `unit_type` (pre-existing, not introduced by this PR)